Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix CI #389

Merged
merged 88 commits into from
Jan 25, 2024
Merged

Fix CI #389

merged 88 commits into from
Jan 25, 2024

Conversation

timothystewart6
Copy link
Contributor

@timothystewart6 timothystewart6 commented Oct 29, 2023

Proposed Changes

I don't even know where to begin 😅

  • GitHub no longer seems to allow us to use nested virtualization on the macos-12 image
  • We need nested virtualization in order to create VMs during CI to run our playbook
  • Nested virtualization isn't enabled on macos-13
  • They claim it's enabled for large runner instances
    • These are paid
    • They are ubuntu not macOS
    • I signed up, tested, and it didn't seem to work
  • Rewrote the playbook to work with Linux (ubuntu)
  • I ended up self-hosting a VM myself
    • Nested virtualization
    • 16 cores
    • 64 GB RAM
    • 100 GB SSD
    • Due to nested virtualization it's still slow, but stable
    • Another downside is that you can only run on job at a time on a self-hosted runner. GHA really needs to start using containers like every other CI out there instead of stateful VMs.
      • If this becomes a problem I will break the one I have into 3. It doesn't effectively or efficiently use the resources I gave it now, probably due to nested virtualization.
  • I ended up upgrading all dependencies while trying to make this work with GitHub's macOS images
  • Our days were numbered with macOS anyway, Virtual Box isn't supporting Apple Silicon
  • I should switch to Vagrant + VirtIO at some point.

I also used this PR to reworking CI steps and caching. While troubleshooting this CI job I found that it was better to cache things up front in a pre step, vs letting each step decide. Steps downstream only restore cache. Also, it seem like a race condition that all 3 molecule tests could download all vms and then save all vms. Caching them up front and forcing them to download only ensures that there's no unnecessary save later.

All of this should work. I am going to merge it shortly after the tests pass (again) so that we can get caught up on PRs

Checklist

  • Tested locally
  • Ran site.yml playbook
  • Ran reset.yml playbook
  • Did not add any unnecessary changes
  • Ran pre-commit install at least once before committing
  • 🚀

@timothystewart6 timothystewart6 changed the title Fix ci Fix CI Oct 29, 2023
@timothystewart6 timothystewart6 enabled auto-merge (squash) October 29, 2023 20:47
@sleiner
Copy link
Contributor

sleiner commented Nov 1, 2023

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

  • Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.
  • Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

@timothystewart6 timothystewart6 added the do not merge Do not merge this PR label Nov 1, 2023
@timothystewart6
Copy link
Contributor Author

@timothystewart6 I have just seen this - and I am not convinced that this would actually improve CI in its current form:

  • Why would running the Molecule jobs serially improve reliability? We are using GitHub-hosted runners, so we have no control over the actual machine that the jobs run. After checking the docs, I am very confident that "scheduling jobs from the same workflow" is not a feature that GitHub actions currently has. You can pin self-hosted runners to a on organization or repo or require certain labels for specific jobs, so if we had a self-hosted runner, we could set GitHub Actions up to use only that one and set the runner up to only accept one job at the time. But for GitHub-hosted runners, everything goes in to the same pool of shared runners, so limiting our concurrency does not help us. Instead, as far as I see, it only increases the time until all CI results are actually available.
  • Concerning the Vagrant cache: I added this because (at the time at least), the GitHub runners were frequently (~30-50% of the time) getting HTTP 429: Too many requests responses from Vagrant cloud. So the cache improved reliability (and we were better citizens towards Vagrant Cloud) at the time. If that problem is not acute anymore, it is safe to remove the cache IMO.

Thanks @sleiner I am coming to the same conclusion too. This was more of debugging experiment to track down build errors. Our error seems to be a Virtual Box error.

  • First was to run jobs serially to see if concurrency was an issue since we are using hosted runners. My suspicion was that we might be scheduled on the same host and somehow there was a conflict with 2 virtual box instances running.
  • Next was to remove build caches (I manually cleared them from the runners)
  • Then removed the cache job from CI to see if we might have cached a bad state for the Vagrant VMs.

This PR will not be merged, I am just working some of my suspicions, I have a few more to chase down.

Virtual Box Error

 fatal: [localhost]: FAILED! => {"changed": false, "cmd": ["/usr/local/bin/vagrant", "up", "--no-provision"], "msg": "Failed to start the VM(s): See log file '/Users/runner/.cache/molecule/k3s-ansible/default/vagrant.err'", "rc": 1, "stderr": "### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:38 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\n### 2023-10-30 21:06:55 ###\nThe guest machine entered an invalid state while waiting for it\nto boot. Valid states are 'starting, running'. The machine is in the\n'gurumeditation' state. Please verify everything is configured\nproperly and try again.\n\nIf the provider you're using has a GUI that comes with it,\nit is often helpful to open that and watch the machine, since the\nGUI often has more helpful error messages than Vagrant can retrieve.\nFor example, if you're using VirtualBox, run `vagrant up` while the\nVirtualBox GUI is open.\n\nThe primary issue for this error is that the provider you're using\nis not properly configured. This is very rarely a Vagrant issue.\n", "stderr_lines": ["### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:38 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "### 2023-10-30 21:06:55 ###", "The guest machine entered an invalid state while waiting for it", "to boot. Valid states are 'starting, running'. The machine is in the", "'gurumeditation' state. Please verify everything is configured", "properly and try again.", "", "If the provider you're using has a GUI that comes with it,", "it is often helpful to open that and watch the machine, since the", "GUI often has more helpful error messages than Vagrant can retrieve.", "For example, if you're using VirtualBox, run `vagrant up` while the", "VirtualBox GUI is open.", "", "The primary issue for this error is that the provider you're using", "is not properly configured. This is very rarely a Vagrant issue."]}

@sleiner
Copy link
Contributor

sleiner commented Nov 1, 2023

From the error message that you posted:

The guest machine entered an invalid state while waiting for it
to boot. Valid states are 'starting, running'. The machine is in the
'gurumeditation' state. Please verify everything is configured
properly and try again.

That's a trippy error message if I've ever seen one 😅

If the provider you're using has a GUI that comes with it,
it is often helpful to open that and watch the machine, since the
GUI often has more helpful error messages than Vagrant can retrieve.
For example, if you're using VirtualBox, run vagrant up while the
VirtualBox GUI is open.

Hmm, accessing the GUI will not be possible, but maybe we can get the actual VirtualBox logs out?

@sleiner
Copy link
Contributor

sleiner commented Nov 1, 2023

@timothystewart6 btw if you are testing runner types, it might be worthwhile to check out macos-13. You even get 33 % more cores (4 instead of 3) 😄

@timothystewart6
Copy link
Contributor Author

timothystewart6 commented Nov 1, 2023

@sleiner good eye! I only saw latest/12 and tested that. I will try to log out more from VB later too!

@timothystewart6
Copy link
Contributor Author

timothystewart6 commented Nov 1, 2023

Ah, vagrant doesn't seem to be installed on macos-13 , might need to brew install it. I feel like at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

@sleiner
Copy link
Contributor

sleiner commented Nov 1, 2023

at that point the runner should just be on ubuntu since it makes a self-hosted version possible at some point (unless I buy a mac mini 😅 )

In principle, porting this to Ubuntu is trivial (since Molecule, Vagrant and VirtualBox all work on Ubuntu too). In GitHub Actions, we are using macOS however, because at the time these were the only GitHub-hosted runners with support for nested virtualization. In the mean time, GitHub seems to have added KVM support to large runners (i.e., the ones that are not free to use). So as far as I see, with GitHub-hosted runners, macOS is the only viable option.

@timothystewart6
Copy link
Contributor Author

Yeah, I just read through all of those threads. Seems like enabling it on macOS was an oversight. I wonder if that's why vagrant isn't installed on 13, maybe they "fixed the bug" 😆. I'll keep plowing through this and figure it out!

@joaopedrocg27
Copy link
Contributor

joaopedrocg27 commented Dec 7, 2023

Hey

First of all let me say that I am not familiar with molecule but I tried to run the test workflow on my fork and it looks like this is facing some resource starvation.

One of the steps ran without an issue: Run Link

Are we provisioning several vms at the same time? If so, could we split those workloads? (at least just to test)
I am getting a Disconnect error which mean that the runner service died. I've had this happen when the host ran out of memory or with high CPU times (we fixed those with a higher nice but I doubt that GH managed runners allow that easily)

It could also be the cache since it only worked the first time

Hope that helps

@joaopedrocg27
Copy link
Contributor

joaopedrocg27 commented Dec 7, 2023

according to this

"It's getting a triple fault error, which possibly means disk corruption"

So another vote on the cache

@timothystewart6
Copy link
Contributor Author

I think I found the answer is that we are running out of resources during CI. Reducing allocation seemed to fix it on the first attempt. I have a feeling that sometimes we would get lucky and land on a macOS machine that had enough resources to process our CI, most other times it would not.

Similar Issue (it's not just us)
actions/runner-images#8730

After that change, it worked on the first try.

@timothystewart6
Copy link
Contributor Author

I am most likely going to add some additional changes to this job.

  1. Create a cache dependencies pre-step. Caching them when we run the jobs in parallel is a race condition.

  2. Update to the latest macOS runner version. This might lead to installing vagrant and virtualbox ourselves. Details of image here

  3. I might consider running these in serial to limit runner usage. This does not change anything functionally but is really about being a good neighbor. Also, if one fails, there's no need to run the others.

@timothystewart6 timothystewart6 removed the do not merge Do not merge this PR label Jan 21, 2024
@timothystewart6
Copy link
Contributor Author

timothystewart6 commented Jan 23, 2024

After working on this for about 2 days straight, I think I finally understand what's going on

actions/runner-images#8642

Turns out nested virtualization was working on some macos-12 images, but they've slowly been replacing them with new images. This explains why sometimes our builds would pass, we got lucky and landed on one of the old images that support it.

Turns out nested virtualization is only supported on their large runners, which is a paid teams feature. I will look into cost and I will also consider running self-hosted runners. CI for this is pretty important but also not sure I can afford GitHub Team when I am a solo developer.

@timothystewart6
Copy link
Contributor Author

I signed up for teams, converted to ubuntu, and provisioned a large runner and I am still seeing the same thing. That thread has so many conflicting comments it's hard to know what's right. Someone also mentioned after the update to ubuntu runners the other day that it now works on ubuntu-latest too. Testing that too. If this all fails I will self host a runner with nested virtualization turned on.

@timothystewart6
Copy link
Contributor Author

Good news, my self-hosted runners are working! I am going to clean up this PR before merge, I will also update the summary too.

@timothystewart6
Copy link
Contributor Author

timothystewart6 commented Jan 25, 2024

Updated PR description to reflect everything that I did. Going to merge after tests pass.

Here it is in action!

image

@timothystewart6 timothystewart6 merged commit e2e9881 into master Jan 25, 2024
6 checks passed
@timothystewart6 timothystewart6 deleted the fix-ci branch January 25, 2024 04:26
@timothystewart6
Copy link
Contributor Author

Just reporting back, we now have 3 self-hosted runners so jobs can run concurrently

@timothystewart6
Copy link
Contributor Author

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants